Cocaine seizures:

## 'data.frame':    3380 obs. of  5 variables:
##  $ state  : chr  "WA" "CT" "FL" "OH" ...
##  $ potency: num  77 51 68 69 75 73 54 58 77 49 ...
##  $ weight : num  217 248 43 123 118 127 50 140 127 74 ...
##  $ month  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ price  : num  5000 4800 3500 3500 3400 3000 3000 2800 2600 2600 ...
##     state              potency          weight           month       
##  Length:3380        Min.   : 2.00   Min.   :  1.00   Min.   : 1.000  
##  Class :character   1st Qu.:49.00   1st Qu.:  3.00   1st Qu.: 3.000  
##  Mode  :character   Median :63.00   Median : 11.00   Median : 7.000  
##                     Mean   :62.05   Mean   : 23.72   Mean   : 6.414  
##                     3rd Qu.:76.00   3rd Qu.: 29.00   3rd Qu.: 9.000  
##                     Max.   :98.00   Max.   :505.00   Max.   :12.000  
##      price       
##  Min.   :  10.0  
##  1st Qu.: 200.0  
##  Median : 500.0  
##  Mean   : 814.3  
##  3rd Qu.:1100.0  
##  Max.   :9000.0
  1. We will only consider here the seizures with weight below 200 grams. Define a new data-frame “cocaine2” accordingly. Hint: You can use the command subset() to consider only part of the data set.

  2. In which three states do seizures happen most frequently? Investigate this graphically. Hint: Create a barplot of the variable state.

# Florida, New York, Virginia
  1. Draw a histogram of the variable weight. Is there a tendency that seizures with smaller weight happen more often?

# Yes, there is quite a high tendency for seizures of smaller weight to happen more frequently. 
  1. What does influence the price of cocaine? Hint: Explore the relation between the variables potency, weight and price using simple scatterplots.
# Price in relation to weight:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

# Price in relation to potency:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

  1. BONUS: We can even visualize the three variables together in one plot. Create a scatterplot of the variable price against the variable potency. Color the points according to the variable weight. Beautify the plot by making the points half-transparent. Finally, it might be useful to log-transform the color scale. Hint: Use the argument color = … of the function aes(…) to color the points. Further try to use the functions …+ geom_point(alpha = …) + scale_color_gradient(transf = …) for a nicer visualization.
# Not log transformed 
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

# Log transformed
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

  1. Plot potency against weight. Add a smoother to the scatterplot. What do you notice? Hint: Use the function …+ geom_smooth() for smoothing.
# Potency against weight (potency on x axis)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

#Potency against weight (potency on y axis)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

# Again the graph shows that there is no correlation between potency and price and weight does not affect potency. 

Flights data.

The aim of this exercise is to check if there is a relation between the average arrival delay and the time of departure of planes. Load the package nycflights13, which contains the on-time data flights, using the com- mand require(nycflights13). The flights data set is about all the flights departing from one of the airport of New York in 2013.

In particular, the interest lies in the following variables: • hour, minute: the hour and minute of the departure • arr delay: the arrival delay of the incoming plane (in minutes) • dest: the destination.

  1. Let’s look at the average arrival delay for a given departure time of the day (hour and minute). For this purpose create a new variable which encodes a given hour and minute as one decimal number and call this new variable time. Thereafter calculate the average arrival delay per value of the variable time and save it in a new data frame named delay.per.hour. Hint: First create the new variable time with the command flights\(time <- flights\)hour + flights$minute / 60

Use the following function call in order to calculate the average delay per value of time: aggregate(formula = arr_delay ~ time, data = flights, FUN = …, na.rm = …)

  1. Plot the average arrival delay against time. What do you conclude?
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

The plot shows a break in data points between 12pm and 5am. This makes sense as most flights are around 8am and late afternoon/early evening. The histogram confirms this assumption. Most delays happen before 10am and around 8am.

  1. Scale the points in the plot by the number of planes n which departed at a particular time of the day. The variable n needs to be calculated and added to the data set delay.per.hour which you defined in task a). Why is this plot more informative then the one of the previous subtask? Hint: The variable n can be calculated as in task a) if you slightly change the function call of aggregate(). Use the argument FUN = length.

Divided by 100 in an effort to be able to more clearly visualize the points. Scaled the scaled points.

# The second plot is more informative because it represents flight count informatino in addition to delays. 

Flights data, continued.

The goal is to explore if there are large differences between destination regarding arrival delay and number of flights. We work again with the flights data set in the package nycflights13 from Exercise 2. If you need to reload the data set, use the command require(nycflights13).

  1. Calculate the average value of the arrival delay arr_delay for each destination (dest). Omit all the missing values in the calculation. Hint: Use the function aggregate(). The argument na.rm = TRUE of the function mean() allows to omit missing values in the calculation of the mean. Note that the function aggregate() creates a dataframe with first column corresponding to the grouping variable (here dest). Save the output of the function aggregate() as a new data frame delay.per.dest.

  2. Calculate the number of planes departing to each destination. Add those counts as variable n to the data frame delay.per.dest. Hint: Use again aggregate() but only save the second column of the output.

  3. Merge the data frames delay.per.dest and airports in order to add the coordinates (lon, lat) of the airports to delay.per.dest. The data frame airports is included in the package nycflights13. Hint: Use the function merge(x = delay.per.dest, y = …, by.x = “dest”, by.y = “faa”, all.x = T, all.y = F) Look at the help file of the function merge() by typing ?merge to understand what the di↵erent arguments mean.

  4. Create a scatterplot of the latitude against the longitude and scale the points according to the number of departing planes. Hint: Use the argument size = … in the function aes().

## Warning: Removed 4 rows containing missing values (geom_point).

e) Moreover, color the points by the value of the average arrival delay. What do you notice?

# Larger airports seem to have closer to zero delays verses smaller airports tend to have either a high occurence of delays or a high occurence of flights leaving early.  

Gapminder: Fact-based world view.

Gapminder Foundation wants to give access to a fact-based world view in order to promote a sustainable global development. For more information and entertaining videos; see http: //www.gapminder.org/. The aim of this exercise is to obtain a nice visualization of the life expectancy vs. the GDP per capita. This is achieved by successively adding more functions to a basic function call. Load the package gapminder using the command require(gapminder, quietly = TRUE) which contains the data set gapminder and the vector country_colors. Take a first look at the data by looking at the help files by typing the commands ?gapminder, ?country_colors and str(). Consider first the data set gapminder.

  1. Let’s pick one particular year of the data set gapminder. Use the function subset() in order to extract the observations of the year 2002. Create a scatterplot of lifeExp against gdpPercap.
  2. Use the variable country in order to color the points of the plot. Scale the points by the square root of the pop and omit the legend. Hint: Use the function …+ geom_point(aes(size = …), pch = 21, show_guide = …)
  3. Reproduce the same plot using a log-scale for the x-axis.
  4. Now make the size of the points a bit larger. The size of the points should range from 1 to 40. Hint: Use the function …+ scale_size_continuous(range = …)
  5. Color the points in a way that you can distinguish between the di↵erent continents. Consider the vector country_colors which provides a color encoding for the continents. Hint: Use the function …+ scale_fill_manual(values = …)
  6. Use facetting to create a separate plot for each continent. Hint: Use the function … + facet_grid( . ~ …)